{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "zAKy_OPy81EZ"
   },
   "source": [
    "# Tutorial: Preprocessing Different File Types\n",
    "\n",
    "- **Level**: Beginner\n",
    "- **Time to complete**: 15 minutes\n",
    "- **Goal**: After completing this tutorial, you'll have learned how to build an indexing pipeline that will preprocess files based on their file type, using the `FileTypeRouter`.\n",
    "\n",
    "> This tutorial uses Haystack 2.0. To learn more, read the [Haystack 2.0 announcement](https://haystack.deepset.ai/blog/haystack-2-release) or visit the [Haystack 2.0 Documentation](https://docs.haystack.deepset.ai/docs/intro).\n",
    "\n",
    "> 💡 (Optional): After creating the indexing pipeline in this tutorial, there is an optional section that shows you how to create a RAG pipeline on top of the document store you just created. You must have a [Hugging Face API Key](https://huggingface.co/settings/tokens) for this section\n",
    "\n",
    "## Components Used\n",
    "\n",
    "- [`FileTypeRouter`](https://docs.haystack.deepset.ai/docs/filetyperouter): This component will help you route files based on their corresponding MIME type to different components\n",
    "- [`MarkdownToDocument`](https://docs.haystack.deepset.ai/docs/markdowntodocument): This component will help you convert markdown files into Haystack Documents\n",
    "- [`PyPDFToDocument`](https://docs.haystack.deepset.ai/docs/pypdftodocument): This component will help you convert pdf files into Haystack Documents\n",
    "- [`TextFileToDocument`](https://docs.haystack.deepset.ai/docs/textfiletodocument): This component will help you convert text files into Haystack Documents\n",
    "- [`DocumentJoiner`](https://docs.haystack.deepset.ai/docs/documentjoiner): This component will help you to join Documents coming from different branches of a pipeline\n",
    "- [`DocumentCleaner`](https://docs.haystack.deepset.ai/docs/documentcleaner) (optional): This component will help you to make Documents more readable by removing extra whitespaces etc.\n",
    "- [`DocumentSplitter`](https://docs.haystack.deepset.ai/docs/documentsplitter): This component will help you to split your Document into chunks\n",
    "- [`SentenceTransformersDocumentEmbedder`](https://docs.haystack.deepset.ai/docs/sentencetransformersdocumentembedder): This component will help you create embeddings for Documents.\n",
    "- [`DocumentWriter`](https://docs.haystack.deepset.ai/docs/documentwriter): This component will help you write Documents into the DocumentStore"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "96w6PrcPk4Fc"
   },
   "source": [
    "## Overview\n",
    "\n",
    "In this tutorial, you'll build an indexing pipeline that preprocesses different types of files (markdown, txt and pdf). Each file will have its own `FileConverter`. The rest of the indexing pipeline is fairly standard - split the documents into chunks, trim whitespace, create embeddings and write them to a Document Store.\n",
    "\n",
    "Optionally, you can keep going to see how to use these documents in a query pipeline as well."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "rns_B_NGN0Ze"
   },
   "source": [
    "## Preparing the Colab Environment\n",
    "\n",
    "- [Enable GPU Runtime in Colab](https://docs.haystack.deepset.ai/docs/enabling-gpu-acceleration)\n",
    "- [Set logging level to INFO](https://docs.haystack.deepset.ai/docs/logging)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "_pG2jycZLYYr"
   },
   "source": [
    "## Installing dependencies\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "2mP4empwf_x4"
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "pip install haystack-ai\n",
    "pip install \"sentence-transformers>=3.0.0\" \"huggingface_hub>=0.23.0\"\n",
    "pip install markdown-it-py mdit_plain pypdf\n",
    "pip install gdown"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "HnXumz7EarJx"
   },
   "source": [
    "### Enabling Telemetry\n",
    "\n",
    "Knowing you’re using this tutorial helps us decide where to invest our efforts to build a better product but you can always opt out by commenting the following line. See [Telemetry](https://docs.haystack.deepset.ai/docs/enabling-telemetry) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "CkvJIU7FmDf9"
   },
   "outputs": [],
   "source": [
    "from haystack.telemetry import tutorial_running\n",
    "\n",
    "tutorial_running(30)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "7GWbC28fX0Yp"
   },
   "source": [
    "## Download All Files\n",
    "\n",
    "Files that you will use in this tutorial are stored in a [GDrive folder](https://drive.google.com/drive/folders/1n9yqq5Gl_HWfND5bTlrCwAOycMDt5EMj). Either download files directly from the GDrive folder or run the code below. If you're running this tutorial on colab, you'll find the downloaded files under \"/recipe_files\" folder in \"files\" tab on the left.\n",
    "\n",
    "Just like most real life data, these files are a mishmash of different types."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import gdown\n",
    "\n",
    "url = \"https://drive.google.com/drive/folders/1n9yqq5Gl_HWfND5bTlrCwAOycMDt5EMj\"\n",
    "output_dir = \"recipe_files\"\n",
    "\n",
    "gdown.download_folder(url, quiet=True, output=output_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "DH8HEymp6XFZ"
   },
   "source": [
    "## Create a Pipeline to Index Documents\n",
    "\n",
    "Next, you'll create a pipeline to index documents. To keep things uncomplicated, you'll use an `InMemoryDocumentStore` but this approach would also work with any other flavor of `DocumentStore`.\n",
    "\n",
    "You'll need a different file converter class for each file type in our data sources: `.pdf`, `.txt`, and `.md` in this case. Our `FileTypeRouter` connects each file type to the proper converter.\n",
    "\n",
    "Once all our files have been converted to Haystack Documents, we can use the `DocumentJoiner` component to make these a single list of documents that can be fed through the rest of the indexing pipeline all together."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "c_eM4C1cA4j6"
   },
   "outputs": [],
   "source": [
    "from haystack.components.writers import DocumentWriter\n",
    "from haystack.components.converters import MarkdownToDocument, PyPDFToDocument, TextFileToDocument\n",
    "from haystack.components.preprocessors import DocumentSplitter, DocumentCleaner\n",
    "from haystack.components.routers import FileTypeRouter\n",
    "from haystack.components.joiners import DocumentJoiner\n",
    "from haystack.components.embedders import SentenceTransformersDocumentEmbedder\n",
    "from haystack import Pipeline\n",
    "from haystack.document_stores.in_memory import InMemoryDocumentStore\n",
    "\n",
    "document_store = InMemoryDocumentStore()\n",
    "file_type_router = FileTypeRouter(mime_types=[\"text/plain\", \"application/pdf\", \"text/markdown\"])\n",
    "text_file_converter = TextFileToDocument()\n",
    "markdown_converter = MarkdownToDocument()\n",
    "pdf_converter = PyPDFToDocument()\n",
    "document_joiner = DocumentJoiner()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ef8okackBSmk"
   },
   "source": [
    "From there, the steps to this indexing pipeline are a bit more standard. The `DocumentCleaner` removes whitespace. Then this `DocumentSplitter` breaks them into chunks of 150 words, with a bit of overlap to avoid missing context."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "hCWlpiQCBYOg"
   },
   "outputs": [],
   "source": [
    "document_cleaner = DocumentCleaner()\n",
    "document_splitter = DocumentSplitter(split_by=\"word\", split_length=150, split_overlap=50)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Do4nhM4tBaZL"
   },
   "source": [
    "Now you'll add a `SentenceTransformersDocumentEmbedder` to create embeddings from the documents. As the last step in this pipeline, the `DocumentWriter` will write them to the `InMemoryDocumentStore`.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "TVXSX0GHBtdj"
   },
   "outputs": [],
   "source": [
    "document_embedder = SentenceTransformersDocumentEmbedder(model=\"sentence-transformers/all-MiniLM-L6-v2\")\n",
    "document_writer = DocumentWriter(document_store)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "hJaJmGanBv1b"
   },
   "source": [
    "After creating all the components, add them to the indexing pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "4yGXKHEXIZxi"
   },
   "outputs": [],
   "source": [
    "preprocessing_pipeline = Pipeline()\n",
    "preprocessing_pipeline.add_component(instance=file_type_router, name=\"file_type_router\")\n",
    "preprocessing_pipeline.add_component(instance=text_file_converter, name=\"text_file_converter\")\n",
    "preprocessing_pipeline.add_component(instance=markdown_converter, name=\"markdown_converter\")\n",
    "preprocessing_pipeline.add_component(instance=pdf_converter, name=\"pypdf_converter\")\n",
    "preprocessing_pipeline.add_component(instance=document_joiner, name=\"document_joiner\")\n",
    "preprocessing_pipeline.add_component(instance=document_cleaner, name=\"document_cleaner\")\n",
    "preprocessing_pipeline.add_component(instance=document_splitter, name=\"document_splitter\")\n",
    "preprocessing_pipeline.add_component(instance=document_embedder, name=\"document_embedder\")\n",
    "preprocessing_pipeline.add_component(instance=document_writer, name=\"document_writer\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "y89Z9jwUfNbr"
   },
   "source": [
    "Next, connect them 👇"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 1000
    },
    "id": "gafXWtNYfNbr",
    "outputId": "10f351de-ac09-4273-85a2-ac7b59fb2f77"
   },
   "outputs": [],
   "source": [
    "preprocessing_pipeline.connect(\"file_type_router.text/plain\", \"text_file_converter.sources\")\n",
    "preprocessing_pipeline.connect(\"file_type_router.application/pdf\", \"pypdf_converter.sources\")\n",
    "preprocessing_pipeline.connect(\"file_type_router.text/markdown\", \"markdown_converter.sources\")\n",
    "preprocessing_pipeline.connect(\"text_file_converter\", \"document_joiner\")\n",
    "preprocessing_pipeline.connect(\"pypdf_converter\", \"document_joiner\")\n",
    "preprocessing_pipeline.connect(\"markdown_converter\", \"document_joiner\")\n",
    "preprocessing_pipeline.connect(\"document_joiner\", \"document_cleaner\")\n",
    "preprocessing_pipeline.connect(\"document_cleaner\", \"document_splitter\")\n",
    "preprocessing_pipeline.connect(\"document_splitter\", \"document_embedder\")\n",
    "preprocessing_pipeline.connect(\"document_embedder\", \"document_writer\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "3NGinkHPB9C2"
   },
   "source": [
    "Let's test this pipeline with a few recipes I've written. Are you getting hungry yet?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "9Mw5kwZiqehc"
   },
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "preprocessing_pipeline.run({\"file_type_router\": {\"sources\": list(Path(output_dir).glob(\"**/*\"))}})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "TVGb-rteg7E5"
   },
   "source": [
    "🎉 If you only wanted to learn how to preprocess documents, you can stop here! If you want to see an example of using those documents in a RAG pipeline, read on.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "16PnegDR2EmY"
   },
   "source": [
    "## (Optional) Build a pipeline to query documents"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "I06qdWsZibSz"
   },
   "source": [
    "Now, let's build a RAG pipeline that answers queries based on the documents you just created in the section above. For this step, we will be using the [`HuggingFaceAPIGenerator`](https://docs.haystack.deepset.ai/docs/huggingfaceapigenerator) so must have a [Hugging Face API Key](https://huggingface.co/settings/tokens) for this section. We will be using the `HuggingFaceH4/zephyr-7b-beta` model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "bB344ADZr-eG",
    "outputId": "b6030405-5def-4700-8124-2e7ec292e977"
   },
   "outputs": [],
   "source": [
    "import os\n",
    "from getpass import getpass\n",
    "\n",
    "if \"HF_API_TOKEN\" not in os.environ:\n",
    "    os.environ[\"HF_API_TOKEN\"] = getpass(\"Enter Hugging Face token:\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "QASEGJhnIMQL"
   },
   "source": [
    "In this step you'll build a query pipeline to answer questions about the documents.\n",
    "\n",
    "This pipeline takes the prompt, searches the document store for relevant documents, and passes those documents along to the LLM to formulate an answer.\n",
    "\n",
    "> ⚠️ Notice how we used `sentence-transformers/all-MiniLM-L6-v2` to create embeddings for our documents before. This is why we will be using the same model to embed incoming questions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 1000
    },
    "id": "_s--8xEWq8Y9",
    "outputId": "1c050d5f-f2ae-4cd3-e0d4-533397a6af63"
   },
   "outputs": [],
   "source": [
    "from haystack.components.embedders import SentenceTransformersTextEmbedder\n",
    "from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever\n",
    "from haystack.components.builders import PromptBuilder\n",
    "from haystack.components.generators import HuggingFaceAPIGenerator\n",
    "\n",
    "template = \"\"\"\n",
    "Answer the questions based on the given context.\n",
    "\n",
    "Context:\n",
    "{% for document in documents %}\n",
    "    {{ document.content }}\n",
    "{% endfor %}\n",
    "\n",
    "Question: {{ question }}\n",
    "Answer:\n",
    "\"\"\"\n",
    "pipe = Pipeline()\n",
    "pipe.add_component(\"embedder\", SentenceTransformersTextEmbedder(model=\"sentence-transformers/all-MiniLM-L6-v2\"))\n",
    "pipe.add_component(\"retriever\", InMemoryEmbeddingRetriever(document_store=document_store))\n",
    "pipe.add_component(\"prompt_builder\", PromptBuilder(template=template))\n",
    "pipe.add_component(\n",
    "    \"llm\",\n",
    "    HuggingFaceAPIGenerator(api_type=\"serverless_inference_api\", api_params={\"model\": \"HuggingFaceH4/zephyr-7b-beta\"}),\n",
    ")\n",
    "\n",
    "pipe.connect(\"embedder.embedding\", \"retriever.query_embedding\")\n",
    "pipe.connect(\"retriever\", \"prompt_builder.documents\")\n",
    "pipe.connect(\"prompt_builder\", \"llm\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "1ba5puJxIg3V"
   },
   "source": [
    "Try it out yourself by running the code below. If all has gone well, you should have a complete shopping list from all the recipe sources. 🧂🥥🧄"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "qDqrU5emtBWQ"
   },
   "outputs": [],
   "source": [
    "question = (\n",
    "    \"What ingredients would I need to make vegan keto eggplant lasagna, vegan persimmon flan, and vegan hemp cheese?\"\n",
    ")\n",
    "\n",
    "pipe.run(\n",
    "    {\n",
    "        \"embedder\": {\"text\": question},\n",
    "        \"prompt_builder\": {\"question\": question},\n",
    "        \"llm\": {\"generation_kwargs\": {\"max_new_tokens\": 350}},\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ZJueu_V4KP6w"
   },
   "source": [
    "```python\n",
    "{'llm': {'replies': [\"\\n\\nVegan Keto Eggplant Lasagna:\\n\\nIngredients:\\n- 2 large eggplants\\n- A lot of salt (you should have this in your house already)\\n- 1/2 cup store-bought vegan mozzarella (for topping)\\n\\nPesto:\\n- 4 oz basil (generally one large clamshell or 2 small ones)\\n- 1/4 cup almonds\\n- 1/4 cup nutritional yeast\\n- 1/4 cup olive oil\\n- 1 recipe vegan pesto (you can find this in the recipe)\\n- 1 recipe spinach tofu ricotta (you can find this in the recipe)\\n- 1 tsp garlic powder\\n- Juice of half a lemon\\n- Salt to taste\\n\\nSpinach Tofu Ricotta:\\n- 10 oz firm or extra firm tofu\\n- Juice of 1 lemon\\n- Garlic powder to taste\\n- Salt to taste\\n\\nInstructions:\\n1. Slice the eggplants into 1/4 inch thick slices. Some slices will need to be scrapped because it's difficult to get them all uniformly thin. Use them in soup or something, IDK, man.\\n2. Take the eggplant slices and rub both sides with salt. Don't be shy about how much, you're gonna rinse it off anyway.\\n3. Put them in a colander with something underneath it and let them sit for half an hour. This draws the water out so that the egg\"],\n",
    "  'meta': [{'model': 'HuggingFaceH4/zephyr-7b-beta',\n",
    "    ...\n",
    "    }]}}\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "zA7xHckYJHsG"
   },
   "source": [
    "## What's next\n",
    "\n",
    "Congratulations on building an indexing pipeline that can preprocess different file types. Go forth and ingest all the messy real-world data into your workflows. 💥\n",
    "\n",
    "If you liked this tutorial, you may also enjoy:\n",
    "- [Serializing Haystack Pipelines](https://haystack.deepset.ai/tutorials/29_serializing_pipelines)\n",
    "-  [Creating Your First QA Pipeline with Retrieval-Augmentation](https://haystack.deepset.ai/tutorials/27_first_rag_pipeline)\n",
    "\n",
    "To stay up to date on the latest Haystack developments, you can [sign up for our newsletter](https://landing.deepset.ai/haystack-community-updates). Thanks for reading!"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "gpuType": "T4",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
